Skip to content

Learning Theory

Supervised Learning

Objective

For an unknown \(c\in\mathcal{C}\), given

  • \(x^{(1)},\ldots,x^{(m)}\) (“training data”), where each \(x^{(i)}\) is drawn i.i.d. from a distribution \(D\)
  • \(c(x^{(1)}),\ldots,c(x^{(m)})\) (“labels”)

find a hypothesis \(h(x)\) such that

  • \(h\in\mathcal{C}\)
  • \(\forall i\; h(x^{(i)}) = c(x^{(i)})\)

Equivalently, define

\[\mathrm{Cons}_{\mathcal{C}}(X) := \lbrace h\in\mathcal{C} : \forall i\; h(x^{(i)}) = c(x^{(i)})\rbrace \]

and view learning as a relation/search problem: find \(h\in \mathrm{Cons}_{\mathcal{C}}(X)\).

(Topics listed in lecture: Supervised Learning, Occam’s Razor, Elimination Algorithm.)

Example: Databases

Let \(\mathcal{C}\) be “all databases” (i.e., all possible properties).

A naive algorithm:

  • If we observe \(c(x^{(i)}) = 1\), add \(x^{(i)}\) to our database hypothesis \(h\).

Question: how does this \(h\) perform on unobserved examples?

PAC Learning

We say \(\mathcal{C}\) is PAC-learnable if there exists an algorithm such that for every \(c\in\mathcal{C}\) and every distribution \(D\), given parameters \(n,\varepsilon,\delta\) it

  • uses \(m = \mathrm{poly}(n, 1/\varepsilon, 1/\delta)\) labeled examples drawn i.i.d. from \(D\), and
  • runs in \(\mathrm{poly}(n, 1/\varepsilon, 1/\delta)\) time, and
  • with probability at least \(1-\delta\) (“Probably”) returns \(h\in\mathcal{C}\) such that $\(\Pr_{x\sim D}[h(x)=c(x)] \ge 1-\varepsilon \quad \text{(“Approximately”).}\)$

Proof pattern (union bound over “small” hypotheses)

Define \(\hat h\) to be BAD if $\(\Pr_{x\sim D}[\hat h(x)\ne c(x)] \ge \varepsilon.\)$

Key calculation: if \(\hat h\) is BAD, then the probability it is still consistent with \(m\) i.i.d. samples is

\[\Pr[\hat h\in \mathrm{Cons}_{\mathcal{C}}(X)] \le (1-\varepsilon)^m \le e^{-\varepsilon m}.\]

If there are at most \(2^{B(n,m)+1}\) “small” candidate hypotheses (bounded by some description-length term \(B(n,m)\)), then by a union bound the probability that any small BAD hypothesis remains consistent is at most

\[2^{B(n,m)+1}\cdot e^{-\varepsilon m}.\]

Choosing

  • \(k = 1/\ln 2\) (to convert between \(\ln\) and \(\log_2\)), and
  • \(m\) large enough so that the above expression is \(\le \delta\)

gives the standard PAC-style sample bound.

Example: Survey / empirical validity

Suppose we have asked 20 household units at random from the city if they own or rent and what their income is.

“income > $300k or rent” and suppose all survey results satisfy this property

Let \(A_i\) be that \(i\)-th survey satisfy the above property

suppose independent

\(P(\bigcap_{i=1}^{20} A_i)=\prod^{20}_{i=1}P(A_i)=\frac{\text{\# households satisfy}}{\text{\# total householdes}}\)

Suppose less than 90% of households satisfy the property

The probability would be \(<0.9^{20}\approx0.11\)

Thus, with \(\approx89\%\) confidence that \(\geq 90\%\) households satisfy the property

More Generally

We’ll consider domains correspond to distributions on sets of possible worlds. We say \(C\) is \(1-\varepsilon\) valid if \(P_D(\text{not} C)\leq \varepsilon\). Independently draw 20 examples from \(D\), \(P(\text{all }20\text{ satisfy }C)\leq P_D(\text{not }C)^{20}\)

Learning Conjunctions

Elimination Algorithm

Initialize \(h(x)\) to contain all literals (so \(2n\) literals total).

For each example \(i\):

  • If \(c(x^{(i)}) = 1\), remove from \(h\) every literal \(\ell\) such that \(\ell(x^{(i)})=0\).

Return \(h(x)\).

Theorem

Elimination runs in \(O(mn)\) time and solves \(\mathrm{Cons}_{\text{conj}}\) (when \(\mathcal{C}\) is the class of all conjunctions).

Proof idea. Initially \(h(x)\) contains all literals. If there exists a positive example where a literal is false, that literal cannot appear in the target conjunction, so it is safe to remove it. We never remove literals that truly belong to \(c(x)\) because we only remove literals on examples with \(c(x^{(i)})=1\).

A corresponding sample bound (as written in lecture) was of the form $\(m = \frac{1}{k\varepsilon}\Big(2n + \log_2\tfrac{1}{\delta} + 1\Big), \qquad k=\frac{1}{\ln 2}.\)$

Reduction from \(\mathrm{Cons}\) to PAC

Lemma

If there is a PAC-learning algorithm for a class \(\mathcal{C}\), then there is a polynomial-time algorithm for solving the corresponding consistency/search problem \(\mathrm{Cons}_{\mathcal{C}}\) (for appropriately bounded instances).

Proof

Given a consistency instance with examples $X=\lbrace x^{(1)},\ldots,x^{(m)}\rbrace $, define the distribution

\[D = \mathrm{unif}\lbrace x^{(1)},\ldots,x^{(m)}\rbrace .\]

Run the PAC learner on \((\mathcal{C}, n, \varepsilon=\tfrac{1}{2m}, \delta=\tfrac12)\) using samples drawn i.i.d. from this \(D\).

With probability at least \(1-\delta=1/2\), the learner outputs \(h\) such that

\[\Pr_{x\sim D}[h(x)=c(x)] \ge 1-\tfrac{1}{2m}.\]

But under a uniform distribution over exactly \(m\) points, the error rate is an integer multiple of \(1/m\). If it is \(< 1/(2m)\), then it must be \(0\), hence \(h\) agrees with \(c\) on every \(x^{(i)}\), i.e. \(h\in\mathrm{Cons}_{\mathcal{C}}(X)\).

Consequence

Reduction from \(\mathrm{Cons}\) to PAC: if \(\mathrm{Cons}\) is NP-complete, then so is PAC-learning.

Can PAC but not Consis?

The reduction is phrased as a discrete-gap argument.

  • Run PAC-learning on \((\mathcal C,n,\varepsilon=\tfrac{1}{2m},\delta=\tfrac12)\).
  • Given sample set $X=\lbrace x^{(1)},\ldots,x^{(m)}\rbrace $, define
\[D' = \mathrm{unif}\lbrace x^{(1)},\ldots,x^{(m)}\rbrace .\]
  • Question written: Can \(h(x)\) satisfy PAC-learning but not satisfy \(\mathrm{Consis}^{B(n)}_{\mathcal C}\)?

If there exists an index \(i\) with \(h(x^{(i)})\ne c(x^{(i)})\), then under \(D'\) we have

\[\Pr_{x\sim D'}[x=x^{(i)}]=\tfrac{1}{m}\]

so the agreement probability is

\[\Pr_{x\sim D'}[h(x)=c(x)] = 1-\tfrac{1}{m}\]

(i.e., the error is \(1/m\)). Therefore, achieving error \(\le \varepsilon=\tfrac{1}{2m}\) forces zero error on \(D'\), hence \(h\) is consistent with all \(m\) examples.